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Abstract 

Given a set of points P C K d , the fc-means clustering problem is to find 
a set of k centers C = {ci , Cfe}, Ci € M. d , such that the objective function 
^2 x£P d(x,C) 2 , where d(x,C) denotes the distance between x and the 
closest center in C, is minimized. This is one of the most prominent 
objective functions that have been studied with respect to clustering. 

Desampling [7] is a simple non-uniform sampling technique for choos- 
ing points from a set of points. It works as follows: given a set of points 
P C R d , the first point is chosen uniformly at random from P. Subse- 
quently, a point from P is chosen as the next sample with probability 
proportional to the square of the distance of this point to the nearest 
previously sampled points. 

Desampling has been shown to have nice properties with respect to 
the fc-means clustering problem. Arthur and Vassilvitskii 7 show that 
k points chosen as centers from P using Desampling gives an 0(log k) 
approximation in expectation. Ailon et. al. [5] and Aggarwal et. al. [3] 
extended results of [7] to show that O(k) points chosen as centers using D 2 - 
sampling give O(l) approximation to the fe-means objective function with 
high probability. In this paper, we further demonstrate the power of D 2 - 
sampling by giving a simple randomized (1 + e)-approximation algorithm 
that uses the Desampling in its core. 



1 Introduction 



Clustering problems arise in diverse areas including machine learning, data min- 
ing, image processing and web-search [TTJ [16j H3 US] ■ One of the most commonly 
used clustering problems is the fc-means problem. Here, we are given a set of 
points P in a <i-dimensional Euclidean space, and a parameter k. The goal is to 
find a set C of k centers such that the objective function 

A(P,C) = 5>(p,C7) 2 
peP 

is minimized, where d(p, C) denotes the distance from p to the closest center in 
C. This naturally partitions P into k clusters, where each cluster corresponds to 
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the set of points of P which are closer to a particular center than other centers. 
It is also easy to show that the center of any cluster must be the mean of the 
points in it. In most applications, the parameter fc is a small constant. However, 
this problem turns out to be NP-hard even for fc = 2 [T3] . 

One very popular heuristic for solving the fc-means problem is the Lloyd's 
algorithm [25]. The heuristic is as follows : start with an arbitrary set of k 
centers as seeds. Based on these k centers, partition the set of points into k 
clusters, where each point gets assigned to the closest center. Now, we update 
the set of centers as the means of each of these clusters. This process is repeated 
till we get convergence. Although, this heuristic often performs well in practice, 
it is known that it can get stuck in local minima [6]. There has been lot of 
recent research in understanding why this heuristic works fast in practice, and 
how it can be modified such that we can guarantee that the solution produced 
by this heuristic is always close to the optimal solution. 

One such modification is to carefully choose the set of initial k centers. Ide- 
ally, we would like to pick these centers such that we have a center close to each 
of the optimal clusters. Since we do not know the optimal clustering, we would 
like to make sure that these centers are well separated from each other and yet, 
are representatives of the set of points. A recently proposed idea [231 13 is to pick 
the initial centers using Desampling which can be described as follows. The 
first center is picked uniformly at random from the set of points P. Suppose 
we have picked a set of k' < k centers - call this set C. Then a point p G P 
is chosen as the next center with probability proportional to d(p,C') 2 . This 
process is repeated till we have a set of k centers. 

There has been lot of recent activity in understanding how good a set of 
centers picked by Desampling are (even if we do not run the Lloyd's algorithm 
on these seed centers). Arthur and Vassilvitskii [7] showed that if we pick k 
centers with D -sampling, then the expected cost of the corresponding solution 
to the fc-means instance is within 0(log fc)-factor of the optimal value. Ostrovsky 
et. al. 24 showed that if the set of points satisfied a separation condition (named 
(e 2 , fc)-irreducible as defined in Section [5]), then these k centers give a constant 
factor approximation for the fc-means problem. Ailon et. al. [5j proved a bi- 
critcria approximation property - if we pick 0(k log k) centers by Desampling, 
then it is a constant approximation, where we compare with the optimal solution 
that is allowed to pick k centers only. Aggarwal et. al. [3] give an improved 
result and show that it is enough to pick O(k) centers by Desampling to get a 
constant factor bi-criteria approximation algorithm. 

In this paper, we give yet another illustration of the power of the D - 
sampling idea. We give a simple randomized (1 + e)-approximation algorithm 
for the fc-means algorithm, where e > is an arbitrarily small constant. At the 
heart of our algorithm is the idea of Desampling - given a set of already selected 
centers, we pick a small set of points by Desampling with respect to these se- 
lected centers. Then, we pick the next center as the centroid of a subset of these 
small set of points. By repeating this process of picking fc centers sufficiently 
many times, we can guarantee that with high probability, we will get a set of fc 
centers whose objective value is close to the optimal value. Further, the running 
time of our algorithm is 0(nd-2 6 ^ 2 ^)^- for constant value of fc, this is a linear 
time algorithm. It is important to note that PTAS with better running time are 

1 O notation hides a 0(logAi/e) factor which simplifies the expression. 
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known for this problem. Chen (12) give an O (nkd + d 2 n< 7 ■ 2(*A)° (1) ) algorithm 

for any a > and Fcldman et al. [IX give an O (nkd + d ■ poly(k/e) + 2°( k l e >\ 
algorithm. However, these results often are quite involved, and use the notion 
of coresets. Our algorithm is simple, and only uses the concept of .Desampling. 

1.1 Other Related Work 

There has been significant research on exactly solving the fc-means algorithm 
(see e.g., [2Q,), but all of these algorithms take Q(n kd ) time. Hence, recent re- 
search on this problem has focused on obtaining fast (1 + e)-approximation 
algorithms for any e > 0. Matousek [23] gave a PTAS with running time 
0{ne~ 2k2d \og k n). Badoiu et al. gave an improved PTAS with running 
time 0(2( fe / e ) O<1> d°«nlog o(fe) n). de la Vega et al. Qj] gave a PTAS which 
works well for points in high dimensions. The running time of this algorithm is 
0(g(k,e)n\og k n) where g(k,e) = exp[(fc 3 /e 8 )(ln(fc/e) hifc]. Har-Peled et al. [T5] 
proposed a PTAS whose running time is 0(n + fc fc + 2 6 -(2d+i)fc \og k+1 n \og k i). 
Kumar et al. [3T] gave the first linear time PTAS for fixed k - the running time of 
their algorithm is 0(2^ k /^ dn). Chen [12] used the a new coreset construction 
to give a PTAS with improved running timeofO(ndfc+2( fc / £ ) ( Vn CT ). Recently, 
Feldman et al. [UJ gave a PTAS with running time 0(nkd+d-poly(k/e)+2^ k ^'>) 
- this is the fastest known PTAS (for fixed k) for this problem. 

There has also been work on obtaining fast constant factor approximation 
algorithms for the fc-means problem based on some properties of the input points 
(see e.g. H|). 

1.2 Our Contributions 

In this paper, we give a simple PTAS for the fc-means problem based on the idea 
of Z? 2 -sampling. Our work builds on and simplifies the result of Kumar et al. [21] . 
We briefly describe their algorithm first. It is well known that for the 1-mean 
problem, if we sample a set of 0(1/ e) points uniformly at random, then the mean 
of this set of sampled points is close to the overall mean of the set of all points. 
Their algorithm begins by sampling 0(k/e) points uniformly at random. With 
reasonable probability, we would sample 0(1/ e) points from the largest cluster, 
and hence we could get a good approximation to the center corresponding to 
this cluster (their algorithm tries all subsets of size 0(1/ e) from the randomly 
sampled points). However, the other clusters may be much smaller, and we may 
not have sampled enough points from them. So, they need to prune a lot of 
points from the largest cluster so that in the next iteration a random sample of 
0(k/e) points will contain 0(1/ e) points from the second largest cluster, and 
so on. This requires a non-trivial idea termed as tightness condition by the 
authors. In this paper, we show that the pruning is not necessary if instead of 
using uniform random sampling, one uses Desampling. 

We can informally describe our algorithm as follows. We maintain a set of 
candidate centers C, which is initially empty. Given a set C, \C\ < fc, we add 
a new center to C as follows. We sample a set S of 0(k/e 3 ) points using re- 
sampling with respect to C. From this set of sampled points, we pick a subset 
T and the new center is the mean of this set T. We add this to C and continue. 
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From the property of Z? 2 -sampling ([U[5]), with some constant, albeit small 
probability p' , we pick up a point from a hitherto untouched cluster C of the 
optimal clustering. Therefore by sampling about a/p' points using P 2 -sampling, 
we expect to hit approximately a points from C . If a is large enough, (c.f. 
Lemma [3]), then the centroid of these a points gives a (1 + e) approximation of 
the cluster C". Therefore, with reasonable probability, there will be a choice of a 
subset T in each iteration such that the set of centers chosen are from C". Since 
we do not know T, our algorithm will try out all subsets of size \T\ from the 
sample S. Note that our algorithm is very simple, and can be easily parallelized. 
Our algorithm has running time 0(dn ■ 2 <5 ( fc2 / e )) which is an improvement over 
that of Kumar et al. [3T] who gave a PTAS with running time O (nd ■ 2( fc / e )° (1) ) . 
I 

Because of the relative simplicity, our algorithm generalizes to measures like 
Mahalanobis distance and /^-similar Bregman divergence. Note that these do 
not satisfy triangle inequality and therefore not strict metrics. Ackermann et al. 
j2] have generalized the framework of Kumar et al. [21] to Bregman divergences 
but we feel that the .Desampling based algorithms are simpler. 

We formally define the problem and give some preliminary results in Sec- 
tion^ In Section[31 we describe our algorithm, and then analyze it subsequently. 
In Section^ we discuss PTAS for other distance measures. 



2 Preliminaries 

An instance of the fc-means problem consists of a set P C M. d of n points in 
(/-dimensional space and a parameter fc. For a set of points (called centers) 
C C R d , let A(P, C) denote J2 P £P d(p, C) 2 , i.e., the cost of the solution which 
picks C as the set of centers. For a singleton C — {c}, we shall often abuse 
notation, and use A(P, c) to denote A(P, C). Let Afc(P) denote the cost of the 
optimal fc-means solution for P. 

Definition 1. Given a set of points P and a set of centers C , a point p G P is 
said to be sampled using P 2 -sampling with respect to C if the probability of it 
being sampled, p{p), is given by 

d(p,C) 2 A(M,C) 

P{ ) E, e p^,c) 2 a(p,c) • 

We will also need the following definition from [21] . 

Definition 2 (Irreducibility or separation condition). Given k and e, a set of 

points P is said to be (fc, ^) -irreducible if 

A fc -i(P) > (1 + 7)' A fc (P). 

We will often appeal to the followingresult 20 which shows that uniform 
random sampling works well for 1-meanqj. 

2 It can be used in conjunction with Chen 1 1 21 to obtain a superior running time but at the 
cost of the simplicity of our approach 

3 It turns out that even minor perturbations from uniform distribution can be catastrophic 
and indeed in this paper we had to work around this. 
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Lemma 3 (Inaba et al. 20 ]). Let S be a set of points obtained by independently 
sampling M points with replacement uniformly at random from a point set P. 
Then, for any 8 > 0, 

A(P, {m(5)» < f 1 + JL) . A(P, {m(P)}), 

ZioZds wi£/i probability at least (1 — 8). Here m(X) = (^jxf~^) denotes the 
centroid of a point set X . 

Finally, we will use the following property of the squared Euclidean metric. 
This is a standard result from linear algebra [19 . 

Lemma 4. Let P C R d be any point set and let c 6 R d be any point. Then we 
have the following: 

d(p, cf = d{p, m(P)) 2 + |P| • die, m(P)) 2 , 

p£P p£P 

where m{P) = ( fpf P ) denotes the centroid of the point set. 

Finally, we mention the simple approximate triangle inequality with respect 
to the squared Euclidean distance measure. 

Lemma 5 (Approximate triangle inequality) . For any three points p,q,r G K d 
we have: 

d{p,qf <2-id(p,r) 2 +dir,qf). 

3 PTAS for k- means 

We first give a high level description behind the algorithm. We will also assume 
that the instance is (fc, e)-irreducible for a suitably small parameter e. We shall 
then get rid of this assumption later. The algorithm is described in Figure [1] 
Essentially, the algorithm maintains a set C of centers, where |C| < k. Initially 
C is empty, and in each iteration of Step 2(b), it adds one center to C till its 
size reaches k. Given a set C, it samples a set of S points from P using re- 
sampling with respect to C (in Step 2(b)). Then it picks a subset T of S of size 
M = 0(l/e), and adds the centroid of T to C. The algorithm cycles through all 
possible subsets of size M of S as choices for T, and for each such choice, repeats 
the above steps to find the next center, and so on. To make the presentation 
clearer, we pick a fc-tuple of M-size subsets (si, . . . , Sfe) in advance, and when 
\C\ = i, we pick T as the s.-' 1 subset of S. In Step 2(i), we cycle through all 
such fc-tuples (si, . . . , Sfe). In the analysis, we just need to show that one such 
/c-tuple works with reasonable probability. 

We develop some notation first. For the rest of the analysis, we will fix a 
tuple (si, ... ,Sk) - this will be the "desired tuple", i.e., the one for which we 
can show that the set C gives a good solution. As our analysis proceeds, we 
will argue what properties this tuple should have. Let be the set C at the 
beginning of the i th iteration of Step 2(b). To begin with is empty. Let 
be the set S sampled during the i th iteration of Step 2(b), and TW be the 
corresponding set T (which is the sf 1 subset of S^). 
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Find-k-means(P) 

Let N = (51200 • fc/e 3 ), M = 100/e, and P = (^) 

1. Repeat 2 fe times and output the the set of centers C that give least cost 
2. Repeat for all fc-tuples (si, ...,Sfc) G [P] X [P] x .... x [P] and 
pick the set of centers C that gives least cost 

(a) CM} 

(b) For i 1 to k 

Sample a set S of N points with P/ 2 -sampling (w.r.t. centers C) 
Let T be the sf subset of 5. 
C ^ CU{m{T)}.B 

a For a set of size TV we consider an arbitrary ordering of the subsets of size M of this set. 
b m(T) denote the centroid of the points in T. 



Figure 1: The fc-means algorithm that gives (1 + e)-approximation for any (fc, e)- 
irreducible data set. Note that the inner loop is executed at most 2 fe • f(^)J ~ 
2 k . 2 °(fc/0 times. 

Let Ox,..., Ok be the optimal clusters, and Ci denote the centroid of points 
in Oi. Further, let denote \Oi\, and wlog assume that mi > . . . > mu- Note 
that Ai(Oj) is same as A(Oi, {c;}). Let r» denote the average cost paid by a 
point in Oi, i.e., 

We will assume that the input set of points P are (k, e)-irreducible. We shall 
remove this assumption later. Now we show that any two optimal centers are 
far enough. 

Lemma 6. For any 1 < i,j ' < k,i ^ j, 

d(ci,Cj) 2 >e-(n +rj). 

Proof. Suppose i > j, and hence m, > nij. For the sake of contradiction assume 
d(ci,Cj) 2 < e ■ (ji + rj). Then we have, 

A(Oi U Oj, {ci}) = wii ■ r{ + rrij ■ rj + nij ■ d(ci 7 Cj) 2 (using LemmaH]) 

< rrii ■ ri + rrij ■ rj + rrij ■ e ■ (r j + rj ) 

< (1 + e) • rrii ■ Ti + (1 + e) ■ rrij ■ rj (since rrii > rrij) 

< (l + e)-A(O i UOj,{c h c j }) 

This implies that the centers {ci, Ck}\{cj} give a (l + e)-approximation to the 
fc-means objective. This contradicts the assumption that P is (e, fc)-irreducible. 

□ 

We give an outline of the proof. Suppose in the first i — 1 iterations, we have 
found centers which are close to the centers of some i — 1 clusters in the optimal 
solution. Conditioned on this fact, we show that in the next iteration, we are 
likely to sample enough number of points from one of the remaining clusters 
(c.f. Corollary [5]). Further, we show that the samples from this new cluster 
are close to uniform distribution (c.f. Lemma Since such a sample does 
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not come from exactly uniform distribution, we cannot apply Lemma [3] directly. 
In fact, dealing with the slight non-uniformity turns out to be non-trivial (c.f. 
Lemmas ITD1 and [TTj) . 

We now show that the following invariant will hold for all iterations : let 
C^ -1 ) consist of centers c[, . . . , c' i _ 1 (added in this order). Then, with prob- 
ability at least there exist distinct indices j±, . . . , such that for all 
l = l,...,i-l, 

A(O a ,cQ< (1 + 6/20) -A(0 Ji:Cjl ) (1) 

Suppose this invariant holds for C^ -1 ) (the base case is easy since is empty). 
We now show that this invariant holds for as well. In other words, we just 
show that in the i th iteration, with probability at least 1/2, the algorithm finds 
a center such that 

A(O j! , C / J )<(l + e/20)-A(O Ji , Cji ), 

where ji is an index distinct from {ji, . . . , This will basically show that 

at the end of the last iteration, we will have k centers that give a (1 + e)- 
approximation with probability at least 2~ fe . 

We now show that the invariant holds for C' 1 '. We use the notation devel- 
oped above for C^" 1 ^. Let I denote the set of indices {ji, . . . , Now let 
ji be the index j £ I for which A(Oj, C^ -1 -*) is maximum. Intuitively, condi- 
tioned on sampling from clusters in Oi, ■ ■ ■ ,0k using .Desampling, it is likely 
that enough points from Oj 4 will be sampled. The next lemma shows that there 
is good chance that elements from the sets Oj for j £ I will be sampled. 

Lemma 7. 

EtiA(o*,c(*-i)) " 

Proof. Suppose, for the sake of contradiction, the above statement does not 
hold. Then, 

A(P,C( i - 1 )) = ^A(0 ; ,C (, - 1) )+^A(0 / ,C (l - 1) ) 
lei i$i 

< ^A(0 ; ,C ,(t ' 1) ) + ■^A(Oi,C (< - 1) ) (by our assumption) 
lei e ' lei 

= r^7r5>( <> c<i -") 

' lei 

< 1 + £//2 ° • A i ( usin S the invariant for C {l - 1] ) 

e ' lei 

< (l + cJ.^A^O,) < (l + e).£Ai(O0 

lei ie[k] 

But this contradicts the fact that P is (k, e)-irreducible. □ 
We get the following corollary easily. 
Corollary 8. 

A(0 Jt ,C^) > e 
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The above Lemma and its Corollary say that with probability at least ^r, 
points in the set Oj i will be sampled. However the points within Oj t are not 
sampled uniformly Some points in Oj t might be sampled with higher probability 
than other points. In the next lemma, we show that each point will be sampled 
with certain minimum probability. 

Lemma 9. For any I ^ I and any point p £ 0\, 

d{p,C^) 2 . _1_ j_ 
A^/.C^- 1 )) " mi ' 64' 

Proof. Fix a point p £ O;. Let j t £ I be the index such that p is closest to c' t 
among all centers in C^ -1 -*. We have 

A(O i ,C (l ~ 1) ) < mi ■ r t + m d ■ d(ci,c' t ) 2 (using Lemma |3| 

< mi ■ ri + 2 • mi ■ (d(c;, Cj t ) + d(cj- t , c^) 2 ) (using Lemma [SJ 

< m r n + 2-m r (d( Q , Cjt ) 2 + |T) , (2) 

where the second inequality follows from the invariant condition for C^ -1 -* . Also, 
we know that 

d(p,c' t ) 2 > rf(Cj " C ' )2 -d(c H ,c' t ) 2 (using Lemma 0) 
8 

die a) 2 e 

> — — • r t (using the invariant for C^ _1 h 

8 20 v B ; 

> d(C3 " C ' ) (Using Lemma© (3) 
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So, we get 



Aro^-m ^ Tfi 7 +9^r )2 (using ©and©) 

A(C>z,6^ ^) 16 • mi ■ [n + 2 [d{c jt ,ci) z + -^)) 

> • - — — ; — > (using Lemma El 

~ 16 -mi (l/e) + 2+l/20 ~ 64 • mi ^ 

□ 

Recall that is the sample of size N in this iteration. We would like to 
show that that the invariant will hold in this iteration as well. We first prove a 
simple corollary of Lemma [3] 

Lemma 10. LetQ be a set ofn points, and"/ be a parameter, < 7 < 1. Define 
a random variable X as follows : with probability 7, it picks an element of Q 
uniformly at random, and with probability 1 — 7, it does not pick any element 
(i.e., is null). Let X\, . . . ,Xg be i independent copies of X , where t = 4^r- Let 
T denote the (multi-set) of elements of Q picked by X\, . . . ,Xg . Then, with 
probability at least 3/4, T contains a subset U of size which satsifies 

A(P,m(U))< (l + ^)Ai(P) (4) 
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Proof. Define a random variable /, which is a subset of the index set {1, . . . ,£}, 
as follows I = {t : X t picks an element of Q, i.e., it is not null}. Conditioned 
on / = \t\, . . . , t r }, note that the random variables X tl , . . . , X tr are independent 
uniform samples from Q. Thus if |/| > — , then Lemma [3] implies that with 
probability at least 0.8, the desired event (j4]) happens. But the expected value 
of |/| is — , and so, |/| > ^ with high probability, and hence, the statement 
in the lemma is true. □ 

We are now ready to prove the main lemma. 

Lemma 11. With probability at least 1/2, there exists a subset TW of of 
size at most such that 

&{o H MT (i) ))<^ + ^)-^{o 3 ,). 

Proof. Recall that contains N — 51 ^ 0fc independent samples of P (using 
Desampling) . We are interested in CI Oj i . Let Y\ , . . . , Y/v be N independent 
random variables defined as follows : for any i, 1 < t < N, Y t picks an element 
of P using Desampling with respect to C^ l ~ x ' . If this element is not in Oj t , it 
just discards it (i.e., Y t is null). Let 7 denote Corollary [5] and Lemma [S] 

imply that Y t picks a particular element of Oj i with probability at least . We 

would now like to apply Lemma HU1 (observe that N = 4^r)- We can do this by a 
simple coupling argument as follows. For a particular element p s Oj i , suppose 
Y t assigns probability to it. One way of sampling a random variable X t as 
in Lemma QJJ] is as follows - first sample using Y t . If Y t is null then, X t is also 
null. Otherwise, suppose Y t picks an element p of Oj i . Then, X t is equal to p 
with probability ^^y, null otherwise. It is easy to check that with probability 7, 
X t is a uniform sample from Oj i , and null with probability 1 — 7. Now, observe 
that the set of elements of Oj i sampled by Y\ , . . . , Y/v is always a superset of 
Xi, . . . ,Xpf. We can now use Lemma [TU] to finish the proof. □ 

Thus, we will take the index Si in Step 2(i) as the index of the set as guar- 
anteed by the Lemma above. Finally, by repeating the entire process 2 k times, 
we make sure that we get a (1 + e)-approximate solution with high probability. 

Note that the total running time of our algorithm is (^nd ■ 2 fc • 2^' fc / c ^ . 

Removing the (fc, e)-irreducibility assumption : We now show how to 
remove this assumption. First note that we have shown the following result. 

Theorem 12. // a given point set (k, ^ 1+ ^ 2 y k )- irreducible, then there is an 
algorithm that gives a (1 + ( 1+e y 2 )-fc )~ approximation to the k-means objective 
and that runs in time 0(nd ■ 2°^ /^). 

Proof. The proof can be obtained by replacing e by ( 1+e / 2 )-fc m * ne a bove anal- 
ysis. □ 

Suppose the point set P is not (fc, ( 1+e / 2 )-fc ^irreducible. In that case it will 
be sufficient to find fewer centers that (l + e)-approximate the fc-means objective. 
The next lemma shows this more formally. 
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Theorem 13. There is an algorithm that runs in time 0(nd ■ 2 0( - k /«) and 
gives a (1 + e)- approximation to the k-means objective. 

Proof. Let P denote the set of points. Let 1 < j < k be the largest index such 
that P is (i, ( 1+e /2)-fc ) -irreducible. If no such i exists, then 

Ax(P) < (l + ( IT1 ^7l) fe • MP) < (1 + e) • A fc (P), 

and so picking the centroid of P will give a (1 + e)-approximation. 

Suppose such an i exists. In that case, we consider the i-means problem 
and from the previous lemma we get that there is an algorithm that runs in 
time 0(nd ■ 2 l ■ 2°(' l e ') and gives a (1 + ( 1+e / 2 )-fc ^approximation to the i-means 
objective. Now we have that 

Thus, we are done. □ 



4 Other Distance Measures 

In the previous sections, we looked at the fc-means problem where the dissim- 
ilarity or distance measure was the square of Euclidean distance. There are 
numerous practical clustering problem instances where the dissimilarity mea- 
sure is not a function of the Euclidean distance. In many cases, the points are 
not generated from a metric space. In these cases, it makes sense to talk about 
the general fc-median problem that can be defined as follows: 

Definition 14 (fc- median with respect to a dissimilarity measure). Given a set 
of n objects P C X and a dissimilarity measure D : X x X — > R>o, find a 
subset C of k objects (called medians) such that the following objective function 
is minimized: 

A(P,C) = ^mini;(p,c) 

In this section, we will show that our algorithm and analysis can be easily 
generalized and extended to dissimilarity measures that satisfy some simple 
properties. We will look at some interesting examples. We start by making the 
observation that in the entire analysis of the previous section the only properties 
of the distance measure that we used were given in Lemmas [3HH and [5] We also 
used the symmetry property of the Euclidean metric implicitly. This motivates 
us to consider dissimilarity measures on spaces where these lemmas (or mild 
relaxations of these) are true. For such measures, we may replace d(p, q) 2 (this 
is the square of the Euclidean distance) by D(p, q) in all places in the previous 
section and obtain a similar result. We will now formalize these ideas. 

First, we will describe a property that captures Lemma [3J This is similar 
to a definition by Ackermann et. al. [3] who discuss PTAS for the /c-median 
problem with respect to metric and non-metric distance measures. 
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Definition 15 ((/, 7, <5)-Sampling property). Given < 7, <5 < 1 and / : M x 

K. — > M., a distance measure D over space X is said to have (f,"f,5)-sampling 
property if the following holds: for any set P C X, a uniformly random sample 
S of /(7, 5) points from P satisfies 



Pr 



£l>(p,m(S)) < (l+ 7 )-Ai(P) 



>(!-*), 



where m(S) = ^[Jj 5 ~ denotes the mean of points in S. 

Definition 16 (Centroid property). A distance measure D over space X is said 
to satisfy the centroid property if for any subset PCX and any point c £ X, 
we have: 

D{p, c) = Ai(P) + \P\ ■ D(m(P),c), 

peP 

where m(P) = ~fpr P denotes the mean of the points in P. 

Definition 17 (a-approximate triangle inequality). Given a > 1, a distance 
measure D over space X is said to satisfy a- approximate triangle inequality if 
for any three points p,q,r £ X, D(p, q) < a ■ (D(p, r) + D(r, q)) 

Definition 18 (/3-approximate symmetry). Given < (3 < 1, a distance mea- 
sure D over space X is said to satisfy (3-symmetric property if for any pair of 
points p,q£ X, (3 ■ D{q,p) < D(p, q) < i • D(q,p) 

The next theorem gives the generalization of our results for distance mea- 
sures that satisfy the above basic properties. The proof of this theorem follows 
easily from the analysis in the previous section. The proof of this theorem is 
given in Appendix IA1 

Theorem 19. Let f : K x R -> E. Let a > 0, < f3 < I, and < 5 < 1/2 
be constants and let < e < 1/2. Let r\ — -§3-(l + 1//?)- Let D be a distance 
measure over space X that D follows: 

1. j3- approximate symmetry property, 

2. a- approximate triangle inequality, 

3. Centroid property, and 

4- (/, e, S)-sampling property. 

Then there is an algorithm that runs in time O (nd ■ 2®( k 'f( e /' nk ' ' 2 ^^ and gives a 
(1 + e) -approximation to the k-median objective for any point set PCX, \P\ = n. 

The above theorem gives a characterization for when our non-uniform sam- 
pling based algorithm can be used to obtain a PTAS for a dissimilarity measure. 
The important question now is whether there exist interesting distance measures 
that satisfy the properties in the above Theorem. Next, we look at some distance 
measures other than squared Euclidean distance, that satisfy such properties. 
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4.1 Mahalanobis distance 



Here the domain is R and the distance is defined with respect to a positive 
definite matrix A G R dxd . The distance between two points p, q G Mr is given 
by Da(p, q) = (p — q) T -A- (p — q). Now, we discuss the properties in Theorem ll9l 

1. (Symmetry) For any pair of points p,q G M. d , we have Da(p, q) = Da(q,p)- 
So, the /3-approximate symmetry property holds for j3 = 1. 

2. ( Triangle inequality) 2 J shows that a- approximate triangle inequality holds 
for a = 2. 

3. (Centroid) The centroid property is shown to hold for Mahalanobis dis- 
tance in [ID] . 

4. (Sampling) [3] (see Corollary 3.7) show that Mahalanobis distance satisfy 
the (/, 7, (S)-sampling property for f(j, S) = l/(-fS). 

Using the above properties and Theorem 1191 we get the following result. 

Theorem 20 (/c-median w.r.t. Mahalanobis distance). Let < e < 1/2. 
There is an algorithm that runs in time 0(nd ■ ^ e - ) ) and gives a (1 + e)- 
approximation to the k-median objective function w.r.t. Mahalanobis distances 
for any point set P € M. d , |P| = n. 

4.2 /i-similar Bregman divergence 

We start by defining Bregman divergence and then discuss the required proper- 
ties. 

Definition 21 (Bregman Divergence). Let <f> : X — > M. d be a continuously- 
differentiable real-valued and strictly convex function defined on a closed convex 
set X .The Bregman distance associated with <f> for points p,q G X is: 

D+{p, q ) - 4>{p) - (b(q) - ^(q) T (p - q) 

Where Acf>(q) denotes the gradient of <f> at point q 

Intuitively this can be thought of as the difference between the value of <j> 
at point p and the value of the first-order Taylor expansion of <fi around point 
q evaluated at point p. Bregman divergence includes the following popular 
distance measures: 

• Euclidean distance. D<p(p,q) = \\p — q\\ 2 . Here <p(x) = ||:r|| 2 . 

• Kullback-Leibler divergence. D^,(p,q) — ^2 { Pi ■ In |f — ^2 { (pi — qi). Here 
D<f,(x) =J2i x i- mx i _ x i- 

• Itakura-Saito divergence. D<p(p, q) = ^ln |i — In |i — 1^ . Here 4>(x) — 

• Mahalanobis distance. For a symmetric positive definite matrix U G ]R' ixd , 
the Mahalanobis distance is defined as: Djj(p 7 q) = (p — q) T U(p — q). Here 
<fiu(x) — x T U x. 
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Bregman divergences have been shown to satisfy the Centroid property by 
Banerjee et. al. [TU]. All Bregman divergences do not necessarily satisfy the 
symmetry property or the triangle inequality. So, we cannot hope to use our 
results for the class of all Bregman divergences. On the other hand, some of 
the Bregman divergences that are used in practice satisfy a property called \i- 
similarity (see pQ for an overview of such Bregman divergences). Next, we give 
the definition of ^-similarity. 

Definition 22 (/i-similar Bregman divergence). A Bregman divergence on 
domain X C R d is called ^.-similar for constant < (i < 1, if there exists a 
symmetric positive definite matrix U such that for Mahalanobis distance Djj 
and for each p, q £ X we have: 



Now, a /i-similar Bregman divergence can easily be shown to satisfy approx- 
imate symmetry and triangle inequality properties. This is formalized in the 
following simple lemma. The proof of this lemma is given in the Appendix [Bj 

Lemma 23. Let < fj, < 1. Any ^-similar Bregman divergence satisfies the 
/i- approximate symmetry property and (2/ /i)- approximate triangle inequality. 

Finally, we use the sampling property from Ackermann et. al. [3] who show 
that any /z-similar Bregman divergence satisfy the (/, 7, <5)-sampling property 
for/M) = ^. 

Using all the results mentioned above we get the following Theorem for /x- 
similar Bregman divergences. 

Theorem 24 (fc-median w.r.t. /i-similar Bregman divergences). Let < fi < 1 



and < e < 1/2. There is an algorithm that runs in time O I nd ■ 2 v* 4 I and 



gives a (1 + e) -approximation to the k-median objective function w.r.t. ^.-similar 
Bregman divergence for any point set P £ M. d , \P\ = n. 
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A Proof of Theorem [19 

Here we give a proof of Theorem 1191 For the proof, we repeat the analysis in 
Section [3] almost word-by-word. One the main things we will be doing here is 
replacing all instances of d(p, q) 2 in Section [3] with D(p, q). So, this section will 
look very similar to Section [3] First we will restate Theorem [TH] 

Theorem (Restatement of Theorem [TSJ . Let /:IxR->R. Let a > 0, < 
P < 1, and < 5 < 1/2 be constants and let < e < 1/2. Let r] = ^(1 + 1//3). 
Let D be a distance measure over space X that D follows: 

1. j3- approximate symmetry property, 

2. a- approximate triangle inequality, 

3. Centroid property, and 

4- (/, e, 5)-sampling property. 

Then there is an algorithm that runs in time O (nd ■ 2 < - > ( k 'f( e / nk, °' 2 ^ and gives a 
(1 +e) -approximation to the k-median objective for any point set P C X ', \P\ = n. 

We will first assume that the instance is (k, e)-irreducible for a suitably small 
parameter e. We shall then get rid of this assumption later as we did in Section^ 
The algorithm remains the same and is described in Figure [2] 

We develop some notation first. For the rest of the analysis, we will fix a 
tuple (si, . . . , Sfc) - this will be the "desired tuple", i.e., the one for which we 
can show that the set C gives a good solution. As our analysis proceeds, we 
will argue what properties this tuple should have. Let CW be the set C at the 
beginning of the i th iteration of Step 2(b). To begin with is empty. Let 
S^' be the set S sampled during the i th iteration of Step 2(b), and T"> be the 
corresponding set T (which is the sf 1 subset of S^). 

Let 0\, . . . , Ok be the optimal clusters, and c\, ...,Ck denote the respective 
optimal cluster centers. Further, let m., denote \Oi\, and wlog assume that 
mi > . . . > TOfe. Let r.i denote the average cost paid by a point in Oi, i.e., 

TO; 
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Find-k-median(P) 

Let ry = + 1/0), N = ; M = /( e/r? , .2), and P = (£) 

1. Repeat 2 fe times and output the the set of centers C that give least cost 
2. Repeat for all /c-tuples (si, ...,Sfc) G [P] X [P] x .... x [P] and 
pick the set of centers C that gives least cost 

(a) CM} 

(b) For ii-ltok 

Sample a set S of N points with P 2 -sampling (w.r.t. centers C) 
Let T be the sf subset of S. 
C <- C U {m(T)}. @ 



"For a set of size N we consider an arbitrary ordering of the subsets of size M of this set. 
h m{T) denote the centroid of the points in T. 



Figure 2: The algorithm that gives (1 + e)-approximation for any (fc, e)- 

, \ k 

irreducible data set. Note that the inner loop is executed at most 2 h ■ I ( M ) 

2 k . 2 d(k-f(e/ V fl.2)) time ^ 



First, we show that any two optimal centers are far enough. 
Lemma 25. For any 1 < i < j < k, 

D(cj,Ci) >£■ (n+ rj). 

Proof. Since i < j, we have m, > nij. For the sake of contradiction assume 
D(cj, Ci) < e ■ (r"i + rj). Then we have, 

A(Oj U Oj, {c{}) = rrii ■ n + rrij ■ rj + rrij ■ D(cj, c/) (using Centroid property) 

< m, • Ti + rrij ■ rj + rrij ■ e ■ (r^ + rj ) 

< (1 + e) • rrii ■ n + (1 + e) • rrij ■ rj (since m, > rrij) 

< (l + e)-A(O i UOj,{c h c j }) 

This implies that the centers {ci, Ck}\{cj} give a (l + e)-approximation to the 
fc-median objective. This contradicts the assumption that P is (e, /c)-irreducible. 

□ 

The above lemma gives the following Corollary that we will use in the rest 
of the proof. 

Corollary 26. For any i ^ j, D(ci, Cj) > (0e) ■ (fj + rj). 

Proof. If i > j, then we have D(ci,Cj) > e • (r^ + rj) from the above lemma and 
hence D(a,Cj) > (0e) ■ (r^ + rj). In case i < j, then the above lemma gives 
D{cj,Ci) > e • (fj + rj). Using 0- approximate symmetry property we get the 
statement of the corollary. □ 

We give an outline of the proof. Suppose in the first (i— 1) iterations, we have 
found centers which are close to the centers of some (i— 1) clusters in the optimal 
solution. Conditioned on this fact, we show that in the next iteration, we are 
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likely to sample enough number of points from one of the remaining clusters 
(c.f. Corollary |2"5|). Further, we show that the samples from this new cluster are 
close to uniform distribution (c.f. Lemma I29p . Since such a sample does not 
come from exactly uniform distribution, we cannot use the (/, 7, J)-sampling 
property directly. In fact, dealing with the slight non-uniformity turns out to 
be non-trivial (c.f. Lemmas GUI an d l5Tj) . 

We now show that the following invariant will hold for all iterations : let 
C' 1-1 -' consist of centers c[, . . . , c' i _ 1 (added in this order). Then, with prob- 
ability at least there exist distinct indices j±, . . . , such that for all 
I = 1, 

A(O ii)C 0<(l + e/ v )-A(0 Jn c 3l ) (6) 

Where 77 is a fixed constant that depends on a and j3. With foresight, we fix the 
value of rj = • (1 + 1//?). Suppose this invariant holds for C^ -1 ) (the base 

case is easy since is empty) . We now show that this invariant holds for 

as well. In other words, we just show that in the i th iteration, with probability 

at least 1/2, the algorithm finds a center c! i such that 

^■(Oj z ,c'i) < (1 +e/rj) ■ A(Oj i ,c j -J, 

where ji is an index distinct from {ji, . . . This will basically show that 

at the end of the last iteration, we will have k centers that give a (1 + e)- 
approximation with probability at least 2~ fc . 

We now show that the invariant holds for C' 1 '. We use the notation devel- 
oped above for C^ -1 ^. Let I denote the set of indices {ji, . . . , ji-i}- Now let 
ji be the index j (£. I for which A.(jOj,C^~ 1 ') is maximum. Intuitively, condi- 
tioned on sampling from clusters in O,, • • • , Ok using Z? 2 -sampling, it is likely 
that enough points from Oj i will be sampled. The next lemma shows that there 
is good chance that elements from the sets Oj for j £ I will be sampled. 

Lemma 27. 

E£=iA(Oj,CC'-i)) " 

Proof. Suppose, for the sake of contradiction, the above statement does not 
hold. Then, 

A(P, C (l_1 )) = ^A(0 ; ,C (l - 1) ) + ^A(0 / ,C (l - 1) ) 
lei i$i 

< A(Oj, C {l - 1] ) + ^ • Y A (Oi, C {l - 1] ) (by our assumption) 
lei e ' lei 

1 lei 

< 1 + £ ^ • Y Ai(Oi) (using the invariant for C^) 

~ e ' lei 

< (l + e)-^A x (O0 (using r, = (2a 2 /(3 2 ) ■ (1 + 1/(3) > 4) 



lei 

< (l + e)-J2MOi) 
ie[k] 
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But this contradicts the fact that P is (fc, e)-irreducible. □ 
We get the following corollary easily. 
Corollary 28. 

AjO^CW) > e 

The above Lemma and its Corollary say that with probability at least 
points in the set Oj i will be sampled. However the points within Oj i are not 
sampled uniformly. Some points in Oj i might be sampled with higher probability 
than other points. In the next lemma, we show that each point will be sampled 
with certain minimum probability. 

Lemma 29. For any I I and any point p £ Oi, 

D(p,C^) 1 e 



A(Oj,(7( i-1 )) " to/ 3a/V 

Proof. Fix a point p € Ox. Let j t G I be the index such that p is closest to c' t 
among all centers in C^ 1 " 1 . We have 

A(0/,C^ -1 )) < mi ■ ri + mi ■ D(ci,c' t ) (using Centroid property) 

< mi ■ n + a ■ mi ■ (Z3(q, c Jt ) + D(cj t , c t )) (Using triangle inequality) 



< to/ • n + a ■ mi ■ \D(ci,c jt ) + — j- j , (7) 

where the last inequality follows from the invariant condition for C^ x \ Also, 
we know that the following inequalities hold: 

a ■ {D{p, c t ) + D{c' t , Cj t )) > D(jp, c Jt ) (from approximate triangle inequality) 

(8) 

a ■ (D(ci.p) + D(p,cj t )) > D(ci,Cj t ) (from approximate triangle inequality) 

(9) 

D{p, ci) < D{p, c 3t ) (since p e O/) (10) 

/3 • D(ci,p) < Dip, c/) < (1//3) ■ D[ci,p) (from approximate symmetry) (11) 

Dicj tl c' t ) < (e/77) • rj t (from invariant condition) (12) 

(3 ■ D[cj t ,c t ) < D{c' t ,Cj t ) < • D{cj t ,c' t ) (from approximate symmetry) 

(13) 

Inequalities ([5]), (ITU1) . and (|TT|) gives the following: 

Dip,c lt ) + Dici,p)>^^ 

^Dip,c k) + ^>^1 (using CD) 

^(^■J + ^^>^^ (using W 

p a 
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Using ([5]) and (fH|) . we get the following: 



Using the previous inequality and (|13l) we get the following: 

D(c u c H ) D(cj t ,c' t ) 



D(p,c' t ) > 



a 2 (l + l//3) f3 
D(a,c jt ) e 



~ a 2 (l + l//3) ~tf' Tjt (USing invariant for ° il 
> D< <° u ^ (Using Corollary [HI) (15) 



So, we get 



D ^ 1]) > , D ^ ^ (using © and 



A(0,,C(< *>) ( ?7 /3 2 )-m ; .(r ; +a( J D(c / , Cjt ) + ^)) 



> 



1 1 



(?7/3 2 )-m ; l/(j9e) + a + 

- to 6 o\ ' — (using Corollary 
[irjap) mi 



□ 



Recall that S^> is the sample of size N in this iteration. We would like to 
show that that the invariant will hold in this iteration as well. We first prove a 
simple corollary of Lemma [3] 

Lemma 30. Let Q be a set of n points, and 7 be a parameter, < 7 < 1. 
Define a random variable X as follows : with probability 7, it picks an element 
of Q uniformly at random, and with probability 1 — 7, it does not pick any 
element (i.e., is null). Let Xi,...,Xg be £ independent copies of X, where 
I = — • f(e/rj. 0.2). Let T denote the (multi-set) of elements of Q picked by 
X\, . . . ,Xj>. Then, with probability at least 3/4, T contains a subset U of size 
f{e/r), 0.2) which satsifies 

A(P,m(U))< + -Ai(P) (16) 

Proof. Define a random variable /, which is a subset of the index set {1, . . . ,£}, 
as follows I = {t : X t picks an element of Q, i.e., it is not null}. Conditioned 
on I = {ti, . . . , t r }, note that the random variables X tl , . . . , X tr are independent 
uniform samples from Q. Thus if |/| > /(e/77,0.2), then sampling property wrt. 
D implies that with probability at least 0.8, the desired event (jTB")) happens. But 
the expected value of |/| is 4 • /(e/77,0.2), and so, |/| > /(e/77,0.2) with high 
probability, and hence, the statement in the lemma is true. □ 

We are now ready to prove the main lemma. 
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Lemma 31. With probability at least 1/2, there exists a subset of of 
size at most /(e/77, 0.2) such that 



A(0*,m(T«))< fl + il-AxfO*). 

Proof. Recall that contains N — ( 24 i n ^^/( { /i'°- 2 ) independent samples of 
P (using Desampling) . We are interested in S'W n Oj 4 . Let Y\,...,Y^ be 
iV independent random variables defined as follows : for any t, 1 < t < N, 
Y t picks an element of P using Desampling with respect to C^ 1 ). If this 
element is not in O^, it just discards it (i.e., Y t is null). Let 7 denote 6 ■ 
Corollary [28] and Lemma [29] imply that Y t picks a particular element of Oj i 
with probability at least ^— . We would now like to apply Lemma [30] (observe 

that N = — ■ f(e/r], 0.2)). We can do this by a simple coupling argument as 
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follows. For a particular element p <E , suppose Y t assigns probability 2^ to 
it. One way of sampling a random variable X t as in Lemma [301 is as follows - 
first sample using Y t . If Y t is null, then X t is also null. Otherwise, suppose Y t 
picks an element p of 0^ . Then X t is equal to p with probability , and null 
otherwise. It is easy to check that with probability 7, X t is a uniform sample 
from Oj i , and null with probability 1 — 7. Now, observe that the set of elements 
of Oj i sampled by Yi, . . . , Yjy is always a superset of Xi, ■ . . , X^. We can now 
use Lemma l30l to finish the proof. □ 

Thus, we will take the index Sj in Step 2(i) as the index of the set T^ 1 ' as guar- 
anteed by the Lemma above. Finally, by repeating the entire process 2 k times, 
we make sure that we get a (1 + e)-approximate solution with high probability. 

Note that the total running time of our algorithm is (nd ■ 2 k ■ 2°( fc '^( e /''' - 2 )) J . 

Removing the (fc, e)-irreducibility assumption : We now show how to 
remove this assumption. First note that we have shown the following result. 

Theorem 32. If a given point set (fc, ^ 1+ ^ 2 y k ) -irreducible, then there is an 
algorithm that gives a (1 + ^ 1+ ^ 2 y k ) -approximation to the k-median objective 

with respect to distance measure D and that runs in time O^nd^ ^'^^^ 71 ' ' 2 ^). 

Proof. The proof can be obtained by replacing e by ( 1+e e / / 2 )-fc m * ne a bove anal- 
ysis. □ 

Suppose the point set P is not (fc, ^ 1+e y 2 ^ fc )-irreducible. In that case it 
will be sufficient to find fewer centers that (1 + e)-approximate the fc- median 
objective. The next lemma shows this more formally. 

Theorem 33. There is an algorithm that runs in time 0(nd ■ 2°(' c '-' ? ( € /' ?fe ' - 2 ") 
and gives a (1 + e)- approximation to the k-median objective with respect to D. 

Proof. Let P denote the set of points. Let 1 < j < k be the largest index such 
that P is (i, ( 1+e / 2 ) ■ fc ) -irreducible . If no such i exists, then 



A 1 (P)<(l+ (i + £/2) k ) .Afc(P)<(l + e ).Afc(P), 
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and so picking the centroid of P will give a (1 + e)-approximation. 

Suppose such an i exists. In that case, we consider the z-median problem 
and from the previous lemma we get that there is an algorithm that runs in 
time 0(nd ■ 2 i ■ 2°( i 'f( e / r > k > - 2 ^) and gives a (1 + ( 1+e / 2 )-fc )-approximation to the 
i-median objective. Now we have that 

A ^( 1+ (rWfc)" '-A^d+e)-^. 

Thus, we are done. □ 



B Proof of Lemma 123 

Here we give the proof of Lemma [23l For better readability, we first restate the 
Lemma. 

Lemma (Restatement of Lemma [23]). Let < fi < 1. Any fi-similar Bregman 
divergence satisfies the /i- approximate symmetry property and (2/ ^-approximate 
triangle inequality. 

The above lemma follows from the next lwo sub-lemmas. 

Lemma 34 (Symmetry for //-similar Bregman divergence). Let < /t < 1. 
Consider a fi-similar Bregman divergence on domain X C M. d . For any two 
points p,q eX, we have: [i ■ D^q^p) < D^p^) < i ■ D^q^p) 

Proof. Using equation([5]) we get the following: 

H ■ D (q,p) < fi-Du(q,p) = fi-Du(p,q) < D (p,q) < Du(p,q) = Du(q,p) < - ■ D (q,p). 

□ 

Lemma 35 (Triangle inequality for //-similar Bregman divergence). Let < 
/t < 1. Consider a [i-similar Bregman divergence on domain X C W 1 . For 
any three points p,q,r £ X, we have: (p/2) ■ D c j } (p, r) < D^(p, q) + D^^q, r) 

Proof. We have: 

D 4 ,{p,q) + D 4 ,{q,r) > fj, ■ (Du(p,q) + Du(q,r)) 

> in/2) ■ Dufar) 

> 0*/2).^(p,r) 

The first and third inequality is using equation [S] and the second inequality is 
using the approximate triangle inequality for Mahalanobis distance. □ 
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